In this project, we are examining the growth of in the use of the term “diversity”. To do this, we drew from the MEDLINE database in Web of Science, using the search terms “TS=(diversity)” from 1990-2018 for human research only. This search provided 71,528 total results, which we extracted using the bibliometrix package in R. Next, we converted the abstracts of these articles to a text corpus and then used tidytext - a package designed for computational text analysis in R - to analyze patterns with the abstracts of these data. Below is the replication code for these analyses…

In this first chunk, we load our data and examine the overall growth of articles from our search query. As we can see, there is a pretty sizable growth in scientific literature that uses the term diversity - from about 500 times in 1990 to over 5000 in 2017. The mild drop in publications in 2018 is likely the result of missing data in the Web of Science database, which is a trend we see throughout all of our analyses.

text_data <- read_csv("text_data.csv")
## Parsed with column specification:
## cols(
##   id = col_double(),
##   author = col_character(),
##   title = col_character(),
##   publication = col_character(),
##   abstract = col_character(),
##   year = col_double(),
##   department = col_character(),
##   subject = col_character(),
##   grant_information = col_character(),
##   keyword = col_character(),
##   pubmed_id = col_character(),
##   doi = col_character(),
##   country = col_character()
## )
# checking to see how the overall data looks 
by_year <- text_data %>%
  group_by(year) %>% count(year, sort = TRUE) %>% ungroup()
by_year <- ggplot() + geom_line(aes(y = n, x = year), data = by_year, stat="identity") + 
  labs(title = "overall growth in diversity-related articles from 1990-2018") + 
  theme(axis.title.x = element_blank(), axis.title.y = element_blank())
by_year <- ggplotly(by_year); by_year

Next, we want to look at word frequencies by year in the literature. This chunk of code breaks down how common words occur in the abstracts of our dataset. Note that we also remove some frequently occurring words that are not really relevant to our dataset, but these do not systematically alter our results. You should be able to click through these tables to gain more insight.

# tokenizing the abstract data into words 
abstract_data <- text_data %>% 
  unnest_tokens(word, abstract) %>% 
  anti_join(stop_words)
## Joining, by = "word"
# most frequent word count in abstracts 
abstract_data %>%
  count(word, sort = TRUE)
## # A tibble: 159,784 x 2
##    word          n
##    <chr>     <int>
##  1 diversity 77112
##  2 study     34721
##  3 patients  32787
##  4 human     32166
##  5 results   29482
##  6 1         27696
##  7 genetic   27652
##  8 health    25787
##  9 cell      25463
## 10 analysis  25431
## # … with 159,774 more rows
# adding custom set of stopwords 
my_stopwords <- tibble(word = c(as.character(1:9), 
                                "1", "2", "3", "4", "rights", "reserved", 
                                "copyright", "elsevier", "5", "10"))
abstract_data <- abstract_data %>% anti_join(my_stopwords)
## Joining, by = "word"
# looking at word frequencies by year 
abstract_words <- abstract_data %>%
  group_by(year) %>% 
  count(word, sort = TRUE) %>% ungroup(); abstract_words
## # A tibble: 690,947 x 3
##     year word          n
##    <dbl> <chr>     <int>
##  1  2017 diversity  6399
##  2  2016 diversity  6154
##  3  2015 diversity  5875
##  4  2018 diversity  5813
##  5  2014 diversity  5129
##  6  2013 diversity  5036
##  7  2012 diversity  4693
##  8  2011 diversity  4055
##  9  2010 diversity  3794
## 10  2009 diversity  3336
## # … with 690,937 more rows

Now, we want to look at how the most relevant words vary over time. Brandon chose to include words like diversity, genetic, and population as well as racially-specific and ethnically-specific terms. As we see, the rise of diversity does not necessarily mean that the focus on race or ethnicity is growing in congruence with that term. This could mean that diversity is being used as a catch-all in the scientific literature (i.e. that the multiplicity of the term makes it mean anything and everything) or that diversity is most used in fields like immunology or oncology. We will explore that hypothesis a bit more below.

diversity_terms <- abstract_words %>% 
  filter(word == "diversity" |  word == 'genetic' | word == "population" |
         word == "ethnic" | word == "racial" | word == 'race' | 
         word == 'caucasian' | word == 'african' | word == 'black') 
diversity_terms
## # A tibble: 261 x 3
##     year word          n
##    <dbl> <chr>     <int>
##  1  2017 diversity  6399
##  2  2016 diversity  6154
##  3  2015 diversity  5875
##  4  2018 diversity  5813
##  5  2014 diversity  5129
##  6  2013 diversity  5036
##  7  2012 diversity  4693
##  8  2011 diversity  4055
##  9  2010 diversity  3794
## 10  2009 diversity  3336
## # … with 251 more rows
word_graph <- ggplot() + geom_line(aes(y = n, x = year, colour = word),
                     data = diversity_terms, stat="identity") + 
  labs(title = "growth in the use of diversity-related terms over time") + 
  theme(axis.title.x = element_blank(), axis.title.y = element_blank())

interactive_graph <- ggplotly(word_graph); interactive_graph

We were also interested in the terms that most commonly occured alongside diversity. We can do this by running pairwise counts across the abstracts and then using network analysis to map what those relationships look like. In the graph presented below, the nodes correspond to commonly occuring words in our abstract dataset with the strength of the lines between the nodes aligning with the frequency those words arise in the same abstract. Looks like this graph may need some editing to remove non-theoretically relevant terms.

# co-occurence count of abstracts 
abstract_pairs <- abstract_data %>% 
  pairwise_count(word, id, sort = TRUE, upper = FALSE)
abstract_pairs
## # A tibble: 62,711,759 x 3
##    item1     item2        n
##    <chr>     <chr>    <dbl>
##  1 diversity results  20505
##  2 diversity study    19995
##  3 diversity human    15952
##  4 diversity analysis 15309
##  5 diversity data     13729
##  6 diversity genetic  13281
##  7 diversity based    12362
##  8 diversity studies  12099
##  9 diversity methods  12070
## 10 results   study    11859
## # … with 62,711,749 more rows
# network visualization of most frequent pairs 
set.seed(1234)
abstract_pairs %>%
  filter(n >= 5000) %>%   # may need to alter this number for a cutoff point 
  graph_from_data_frame() %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(edge_alpha = n, edge_width = n), edge_colour = "cyan4") +
  geom_node_point(size = 3) +
  geom_node_text(aes(label = name), repel = TRUE, 
                 point.padding = unit(0.2, "lines")) + theme_void()

The next step is to look at how the concept of “diversity” is used across the world. As the graph below demonstrates, the rise of “diversity” seems mostly to grow in the context of predominantly White, Westernized countries like US, England, the Netherlands, Germany and Switzerland.

text_data$country <- tolower(text_data$country)

# tokenizing the abstract data into words 
abstract_data <- text_data %>% 
  unnest_tokens(word, abstract) %>% 
  anti_join(stop_words)
## Joining, by = "word"
# most frequent word count in abstracts 
abstract_data %>%
  count(word, sort = TRUE)
## # A tibble: 159,784 x 2
##    word          n
##    <chr>     <int>
##  1 diversity 77112
##  2 study     34721
##  3 patients  32787
##  4 human     32166
##  5 results   29482
##  6 1         27696
##  7 genetic   27652
##  8 health    25787
##  9 cell      25463
## 10 analysis  25431
## # … with 159,774 more rows
# looking at word frequencies by year 
diversity_by_country <- abstract_data %>%
  group_by(year) %>% 
  count(word, country, sort = TRUE) %>% ungroup(); diversity_by_country
## # A tibble: 1,528,401 x 4
##     year word      country           n
##    <dbl> <chr>     <chr>         <int>
##  1  2017 diversity united states  2781
##  2  2016 diversity united states  2753
##  3  2015 diversity united states  2635
##  4  2014 diversity united states  2464
##  5  2013 diversity united states  2401
##  6  2012 diversity united states  2324
##  7  2018 diversity united states  2312
##  8  2017 diversity england        2261
##  9  2018 diversity england        2005
## 10  2016 diversity england        1982
## # … with 1,528,391 more rows
diversity_by_country <- diversity_by_country %>% 
  filter(word == "diversity") 
diversity_by_country
## # A tibble: 1,044 x 4
##     year word      country           n
##    <dbl> <chr>     <chr>         <int>
##  1  2017 diversity united states  2781
##  2  2016 diversity united states  2753
##  3  2015 diversity united states  2635
##  4  2014 diversity united states  2464
##  5  2013 diversity united states  2401
##  6  2012 diversity united states  2324
##  7  2018 diversity united states  2312
##  8  2017 diversity england        2261
##  9  2018 diversity england        2005
## 10  2016 diversity england        1982
## # … with 1,034 more rows
diversity_over_time <- ggplot() + geom_line(aes(y = n, x = year, colour = country),
                     data = diversity_by_country, stat="identity") +
  labs(title = "growth in the use of diversity over time (by country)") + 
  theme(axis.title.x = element_blank(), axis.title.y = element_blank())

diversity_over_time <- ggplotly(diversity_over_time); diversity_over_time

Lastly, we wanted to look more into the growth of diversity related terms by scientific subject matter. This snippet of code breaks down the number of words occuring in abstracts over time, which is then broken down by MEDLINE’s and Web of Science’s subject categories. I have opted to only include 12 of the 150 different categories that could have been graphed here. Overall, we see that the rise of diversity in genetics & heredity, biochemistry & molecular biology, microbiology, immunology, and infectious disease research. We do not see this same rise in the social sciences, though there admittedly is some overall growth in that domain.

text_data <- text_data %>% 
  separate(subject, into = paste("subject", 1:15, sep = "_"), sep = ";") %>%
  gather(value, subject, subject_1:subject_15, na.rm = TRUE) %>% select(-value)

text_data <- text_data %>% 
  separate(subject, into = c("subject", "void"), sep = "[(]") %>% select(-void) 

text_data$subject <- stri_trim_both(text_data$subject)
text_data$subject <- tolower(text_data$subject)
# unique(text_data$subject) 

subject_data <- text_data %>% 
  unnest_tokens(word, abstract) %>% 
  anti_join(stop_words)

subject_data %>% 
  count(word, sort = TRUE)
## # A tibble: 159,751 x 2
##    word           n
##    <chr>      <int>
##  1 diversity 437762
##  2 study     217375
##  3 patients  214820
##  4 1         180957
##  5 human     180386
##  6 results   176187
##  7 health    165982
##  8 genetic   157316
##  9 analysis  151810
## 10 isolates  151303
## # … with 159,741 more rows
growth_by_subject <- subject_data %>%
  group_by(year) %>% 
  count(word, subject, sort = TRUE) %>% ungroup()  

subject_data %>% 
  count(subject, sort = TRUE)
## # A tibble: 134 x 2
##    subject                                n
##    <chr>                              <int>
##  1 genetics & heredity              4213638
##  2 biochemistry & molecular biology 4016585
##  3 microbiology                     2216691
##  4 immunology                       1986325
##  5 infectious diseases              1857374
##  6 pharmacology & pharmacy          1466127
##  7 behavioral sciences              1464061
##  8 psychology                       1356911
##  9 pediatrics                       1344591
## 10 cell biology                     1321172
## # … with 124 more rows
graph_by_subject <- growth_by_subject %>% 
  filter(word == "diversity") %>% 
  filter(subject == "genetics & heredity" | subject == "biochemistry & molecular biology" | 
           subject == "microbiology" | subject == "infectious diseases" | subject == "immunology" | 
           subject == "pharmacology & pharmacy" | subject == "behavioral sciences" |
           subject == "health care sciences & services" | subject == "neurosciences & neurology" |
           subject == "psychology" | subject == "sociology" | 
           subject == "oncology" | subject == "business & economics"
         )

graph_by_subject <- ggplot() + geom_line(aes(y = n, x = year, colour = subject),
                     data = graph_by_subject, stat="identity") +
  labs(title = "growth in the use of diversity over time (by subject)") + 
  theme(axis.title.x = element_blank(), axis.title.y = element_blank(), legend.title = element_blank())

graph_by_subject <- ggplotly(graph_by_subject); graph_by_subject

Overall, this document shows a rise in the use of “diversity” across scientific research. We see a 10-fold increase across the 1990s and 2000’s, which mostly unfolds in research deriving from Westernized biomedical scientific research. Our future analyses will examine more what implications this has for the use of diversity in and outside of that domain.